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ABSTRACT 

R-loop is the structure co-transcriptionally formed 
between nascent RNA transcript and DNA template, 
leaving the non-transcribed DNA strand unpaired. 
This structure can be involved in the hyper-mutation 
and dsDNA breaks in mammalian immunoglobulin 
(/g) genes, oncogenes and neurodegenerative 
disease related genes. R-loops have not been 
studied at the genome scale yet. To identify the 
R-loops, we developed a computational algorithm 
and mapped R-loop forming sequences (RLFS) 
onto 66803 sequences defined by UCSC as 
'known' genes. We found that ~59% of these 
transcribed sequences contain at least one RLFS. 
We created R-loopDB (http://rloop.bii.a-star.edu 
.sg/), the database that collects all RLFS identified 
within over half of the human genes and links 
to the UCSC Genome Browser for information 
integration and visualisation across a variety of 
bioinformatics sources. We found that many 
oncogenes and tumour suppressors (e.g. Tp53, 
BRCA1, BRCA2, Kras and Ptprd) and neuro- 
degenerative diseases related genes (e.g. ATM, 
Park2, Ptprd and GLDC) could be prone to signifi- 
cant R-loop formation. Our findings suggest that 
R-loops provide a novel level of RNA-DNA 
interactome complexity, playing key roles in gene 
expression controls, mutagenesis, recombination 
process, chromosomal rearrangement, alternative 
splicing, DNA-editing and epigenetic modifications. 
RLFSs could be used as a novel source of pro- 
spective therapeutic targets. 



INTRODUCTION 

R-loop is a stable RNA-DNA hybrid structure in which 
the RNA strand is base-paired with one DNA strand of a 
DNA duplex, leaving the opposite DNA strand 
single-stranded. The R-loop structure has been first 
characterized over 35 years ago (1). Initial study of 
R-loop focused on the development of 4 R-loop hybridiza- 
tion technique' for visualization of the genetic organiza- 
tion of ribosomal RNA genes in yeast via electron 
microscopy (1-3). The application of this technique also 
led to the discovery of intron by the observation of 
splicing of adenovirus 2 late mRNA under electron micro- 
scope (4). Since then many subsequent applications of 
R-loop hybridization have been developed, which are 
now widely used for the study of gene structure. 

In 1995, Drolet and colleagues first demonstrated that 
R-loop existed in vivo in the bacterial cell (5). In this study, 
the R-loop formation was shown to be a consequence of 
transcription process that resulted in hybridization 
between nascent RNA transcript and DNA template, 
therefore such process was called 'co-transcriptional 
R-loop' formation. R-loops occur in vivo within sequences 
that generate G-rich transcripts at the prokaryotic origins 
of replication, mitochondria and mammalian immuno- 
globulin (Ig) class switch sequences [see for references 
(6)]. R-loop forming structure has been documented in 
mutant yeast that was impaired in RNAP II transcription 
elongation (7). These and other findings generate interest 
to study R-loop forming structures and initiate more 
studies of R-loops in different cells and species. In 
addition, the in vitro techniques of R-loops detection 
have been improved and the mechanistic aspects of 
R-loop formation have been studied. In this article, we 
focus on the analysis of co-transcriptional R-loops 
in vivo rather than R-loop hybridization technique. 
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The two possible mechanisms of R-loop formation 
proposed by Lieber and Roy are 'thread back' and 
'extended hybrid' mechanisms (6,8,9). According to the 
thread back mechanism a nascent RNA is single-stranded 
for a short period of time and then anneals with the 
template DNA strand. In the extended hybrid mechanism, 
the nascent RNA that forms upon transcription fails to 
denature from the template in the transcription bubble, 
due to the high thermodynamic stability between RNA- 
DNA hybrids. The R-loop formation also requires some 
specific pattern of the nucleotide sequence in the DNA 
template and presence of Li + , Na + , K + and Cs + ion to 
form stable R-loop structure. The R-loop formation 
in vivo is a dynamic process involving protein-DNA- 
RNA interactions. Topi (topoisomerase 1) may prevent 
an accumulation of negative supercoiling downstream of 
transcription block and can prevent R-loop formation 
(10). It was shown that NPH-II helicase can efficiently 
unwind a RNA-DNA hybrid containing a purine-rich 
DNA track derived from the 3'-UTR of an early 
vaccinia gene (11). The negative correlation between 
R-loop formation and activity of splicing factor ASF/ 
SF2 in chicken cell line has been demonstrated by Li 
and Manley (12). 

In vitro studies showed that R-loop sequences vary in 
length from 150 to 650 bp in Ig switch region (13), from 
110 to 1280 bp in Bcl6 and from 120 to 770 bp in RhoH 
(14). R-loops are sensitive to over-expression of RNase H, 
the endonuclease which specifically hydrolyzes RNA- 
DNA hybrid. Lieber and Roy proposed a R-loop model 
which depends on the sequence features and its position. It 
includes three distinct parts: R-loop initiation zone (RIZ), 
linker and R-loop elongation zone (REZ). They 
demonstrated that G clusters in RIZ are extremely import- 
ant for the initiation of R-loop formation (8) but not in 
other parts while the linker between RIZ and REZ can be 
of any nucleotide composition. The final part of R-loop, 
REZ sequence, is required to be of high G density but 
does not necessarily have to be a G-cluster. This model 
can be applied for in vivo R-loop detection and facilitate 
the search of potential R-loop forming sequences (RLFS) 
in the genome. 

Until recently, the studies of R-loops have provided 
various examples of significance of RNA-DNA inter- 
actions in a cell. The formation of R-loops during 
replication process in both prokaryotes and eukaryotes 
may lead to replication blockage that is lethal if left unre- 
solved (15). In yeast, inactivation of THO-complex, a 
conserved eukaryotic nuclear complex containing Tho2, 
Hprl, Mftl and Thp2 proteins, induces R loop formation 
that results in reduction of transcription elongation effi- 
ciency and increases incidence of hyper-recombination (7). 
R-loop formation can also be associated with occurrence 
of transcription-associated recombination (TAR) in yeast 
and mammalian cells (16,17). R-loop formation can 
initiate various repair systems, such as homologous re- 
combination (HR) that occurs mainly during late 
S phase of the cell cycle (18,19) and non-homologous 
end joining (NHEJ) involved in antibody matur- 
ation (20). In activated B-lymphocytes of mammals, 
R-loops contribute to immunoglobulin class switch 



recombination (Ig-CSR) that generates antibody 
isotypes (21). 

A number of studies proposed and revealed that 
R-loop formation structure is involved in transcription- 
associated mutation (TAM) (14,22-25). Recent studies 
demonstrated a correlation between R-loop formation 
and activation-induced deaminase (AID) activity, the 
enzyme which (i) is involved in generation of muta- 
tions and recombination events in oncogenes, such as 
Bcl6 and Myc (14,26), and (ii) may affect genome 
instability. 

Interestingly, R-loops are often associated with 
neurodegenerative diseases, including spinocerebellar 
ataxia type 1 (SCA1), myotonic dystrophy (DM1) and 
fragile X type A (FRAXA) (22,23,25). R-loop forming 
structures can be found in the Fmrl and Fxn genes that 
are responsible for neurodegenerative disease (23,25). It 
was demonstrated that R-loops could co-localize with 
some classes of trinucleotide repeat tracks that occur in 
these genes (23). R-loop structures are found when 
Fmrl and Fxn genes are transcribed. The RNA-DNA 
hybridization via R-loop mechanism can generate 
genetic instability that may be associated with the expan- 
sion of the trinucleotide repeats within the disease related 
genes (25). 

While previous studies outlined several examples of the 
functional importance of R-loops, there was no systematic 
analysis done at the genome scale. This analysis can facili- 
tate discovery of new R-loops and their genome localiza- 
tion, which is helpful for better understanding of R-loop 
structures and their functions, RNA-DNA interactome 
complexity and diseases. We hypothesize that R-loops 
can be formed in many genes and may play important 
roles in a variety of biological processes, including gene 
expression regulation, development and cell 
communication. 

In this work, we first developed a quantitative model of 
RLFS, confirmed known RLFS within the genes of the 
human genome. We focus on the RLFS in the 
human genes, because genome mapping, data basing and 
the visualisation of RLFS integrated with other human 
DNA and RNA data could provide a useful tool for 
elucidating the role of R-loop formation phenomena in 
the complexity of function of the genomes and its associ- 
ation with diseases. 

Furthermore, we developed a bioinformatics tool for 
RLFS search and visualization. Our pipeline identified 
RLFS that have previously been discovered in experimen- 
tal studies. Based on our computational analysis, we 
demonstrate for the first time that RLFS are widespread 
throughout the human genome in genes of diverse func- 
tions. We organized our results in R-loopDB database, 
which collects the information about R-loops in each 
annotated human gene. The R-loopDB facilitates the 
interactive and versatile display of R-loops and is 
integrated into the UCSC Genome Browser for informa- 
tion integration from various sources. We further demon- 
strate the potential use of our database in the final part of 
this work. 
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MATERIALS AND METHODS 

Data sources 

DNA sequences of UCSC known genes dataset (the 
human genome; hgl8 or NCBI Build 36.1) in FASTA 
format were downloaded on 23 February 2010. It 
included 66 803 UCSC known gene IDs that were con- 
structed by automated pipeline from UCSC (27). This 
dataset contains RefSeq genes and alternative splicing 
variants of each gene. 

R-loop forming DNA sequence model 

Based on the experimental study of the characteristic of 
R-loop formation by Roy and Michael Lieber (8), we 
propose the following computational model of RLFS. 
The features of RLFS can be partitioned into three 
segments, (i) RIZ; (ii) linker and (iii) REZ or 

RLFS = RIZ+linker+REZ 



RIZ. The DNA regions of initiation of R-loops are 
considered as clusters of a few Gs (3^1 nt) in the 
region. Segment sequence initiates and terminates with 
G-cluster that contains at least three contiguous Gs, 
e.g. GGG N GGG N GGG . G-cluster is important for effi- 
cient R-loop initiation and this feature is included in our 
model. 

Linker. The DNA sequence region between RIZ and REZ 
regions is called linker. The nucleotides in this region are 
not specified in our model. We allow from 0 to 50 nt in the 
linker region. 

REZ. Downstream of RIZ and Linker, REZ can support 
the extension of R-loop with a high G density (8). REZ 
has to be G-rich but does not require G-cluster like RIZ. 
At least 40% of G is required for R-loop formation. In 
our model, nucleotide number of REZ can vary from 
100 to 2000 nt. 

The above model of RLFS is used in our algorithm to 
identify the location of RLFS in the human genes. 



Database construction 

The results of RLFS identification are collected and 
included into our R-loopDB. Presently, R-loopDB is ac- 
cessible via http://rloop.bii.a-star.edu.sg/. The database 
is managed by a MySQL relational database at the 
back-end to support user queries. All HTML pages are 
generated by PHP scripts hosted on an Apache server. 
The graphical view of gene structure and R-loop is 
generated by Perl Bio-Graphics Module. The Java script 
provides interactive interfaces that facilitate site 
navigation. 

Kolmogorov-Waring statistics and parameterization 

The Kolmogorov-Waring (K-W) probability function 
allows description and understanding of evolution 
patterns in the stochastic birth-death process in complex 



evolved systems. At near steady-state of the lin- 
ear birth-death stochastic process, the K-W function 
can be calculated via the following simple recursive 
formula (28): 



J m+\ 



/P*m = & 



(a+m) 
b+m+l 



(1) 



where m = 0, 1,2, ... M [M = max(m)]. The inequalities 
b+1 > a > 0; 0 < 1 provide the necessary and sufficient 
conditions for the stable steady state behaviour of the 
random process (28,29). The parameters a, b and 0, we 
estimated by a method reported in (28). 



Querying the database 

R-loopDB provides user-friendly accessibility with 
multiple search options (Figure IB) that allows user to 
input official gene symbol, gene family keyword, 
Ref-Seq ID, gene description keyword, known gene ID 
and chromosome band as the query term. We recommend 
user to input known gene ID as the input for users who are 
interested in specific alternative splicing sequence. Besides 
searching the genes of interest, R-loopDB provides add- 
itional feature of filtering out genes that contain RLFS in 
the first exon or the first intron. This might be important 
because R-loop could be formed when the RLFS is 
located within 5'-end gene region and efficiency of 
R-loop formation is reduced in the distant downstream 
regions of the gene (9). The optional search is located in 
gene search box. User who is interested in finding RLFS 
located near 5'-end region are recommended to use this 
option. 

The 'search result' page (Figure 1C) is designed in the 
table format including three fields: gene symbol, gene de- 
scription and chromosome band. The user can click on a 
gene symbol link to view the detail page for that particular 
gene. 



Output interface 

R-loopDB allows visualization of RLFS in the selected 
gene (Figure 2) on (i) a gene map (Figure 2A); (ii) 
details of the RLFS sequence structure (Figure 2B); (iii) 
RLFS mapped on the UCSC browser known gene 
(Figure 2C) and (iv) annotation of the gene by NCBI 
search (Figure 2D). The user can navigate to any RLFS 
(see green box in Figure 2A) which is located in a region of 
the gene of interest and see details of the RLFS sequence 
as shown in Figure 2B. This figure provides high-lighted 
sub-sequences of RLFS including RIZ, linker, REZ and G 
(guanine)-cluster (see 'Materials and Methods' section). 
To ensure that users interested in R-loop can conveniently 
find a wide range of information for genes of interest, we 
provide linkage to external databases including UCSC 
Genome Browser and NCBI Entrez Gene. This enables 
integration of other information of genomic context, 
expression data and updated information for the gene of 
interest. 
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Figure 1. R-loop forming structure and representative screenshots of R-loopDB. (A) Transcription with and without R-loop forming structure. 
R-loop initiation zone (RIZ) and R-loop elongation zone (REZ) are highlighted in yellow blue, respectively. (B) The search bar. (C) The search result 
of Bcl6 gene. 



RESULTS AND DISCUSSION 

Data validation 

To validate our findings, we compared predictions from 
our model with previously reported data describing 
R-loop-positive and R-loop-negative genes. Previously, 
R-loop structures have been detected only in a few 
mammalian genes: Ig switch region, Bcl6, Myc, Rhoh, 



Fmrl and Fxn (14,21,23,25,26,30). In two other genes, 
Ig variable heavy chain and a-Myb, no R-loop struc- 
ture have been reported in gene regions (14). We 
compared our prediction results with experimental 
data for these genes and the results were completely 
consistent with the observation. This suggests that our 
RLFS identification method produces reliable results. 
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GGGGCGCKTAGGGATCGGTGTGGAAGCTGCGGCCGCGCGGCGATTCTCGG 
ACCGGCTGGCCTGCCACCTAGCGGTGGGCTGAGATCGAGTTCGCGGCCAT 
GGGCGGTGGGCTCGCCGTTCCACCTCCCGGGCGGCGGTACCCrGACCAGA 
T GCTAGGACT GACAGAAGGAGGG 

Note: RIZ (yellow), REZ (cyan). Linker (no highlight), G cluster (red) 
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Figure 2. Snapshot of a representative R-loopDB results pages for Bcl6 gene. (A) Overview figure that shows all known transcripts of the gene and 
RLFS mapping results. (B) Detailed summary of the RLFS (in green box of A), including sequence structure, location, length and G-cluster. 
(C) Link from RLFS mapping result to UCSC database tracks (URL: http://genome.ucsc.edu/cgi-bin/) (in red box of A), (D) Link from RLFS 
mapping result to NCBI Entrez gene database (URL: http://www.ncbi.nlm.nih.gov/gene/) (in blue box of A). 
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Figure 2 shows an example of analysis of RLFSs within 
Bcl6 gene region. Panel A shows that five RLFSs can be 
found in this gene region and all of these five RLFSs are 
located in the first intron. Panel B provides detailed visu- 
alization of RLFS, demonstrating explicit location of the 
RIZ in the 5' -end of the sequence and the REZ in the 
3'-end of the sequence. In this figure, G-clusters are high- 
lighted. Panel C shows results of our application 
integrated in the UCSC browser viewer. This integration 
allows user to connect information about RLFS localiza- 
tion with many annotation tracks available in UCSC 
browser, which provides more information, such as 
intron or exon localization of RLFS, co-localization of 
RLFS with important regulatory signals [histone methy- 
lation, CpG islands, repeat elements, transcription 
factor-binding sites (TFBSs), etc.] Panel D provides char- 
acteristics of a gene of interest (Bcl6) via link to NCBI 
Entrez gene annotation list. 

Prevalence of R-loops in the human genes 

In total 66 803 sequences of UCSC known genes and splice 
variants were downloaded and studied. We found that 
59% (39 720/66 803) of UCSC known genes and their 
splice variants contain at least one RLFS. We then 
counted the number of RLFS in each UCSC known 
gene sequence. Overall, 245 181 RLFSs from 39 720 
UCSC known gene sequences were found and stored in 
the R-loopDB. 

To prevent over-counting of RLFS location events on 
our further statistical analysis, we merged overlapping 
RLFSs sharing at least 1 nt into single longest DNA 
segment. After overlapped RLFS merging the number of 
RLFSs is 140 106. Figure 3A demonstrates that the fre- 
quency distribution of the number of such RLFSs follows 
the skewed power-law like frequency distribution and it 
can be described well with the K-W birth-death evolution 
model (28). This function is used for statistical 



characterisation of the frequency distribution of occur- 
rence of diverse structurally and functionally important 
signals, for instance TFBSs in a gene promoter region of 
a given eukaryotic genome (29), domains or structure 
motifs in a protein of a given proteome (28). Such type 
frequency distributions are sample size-dependent (not 
scale-free) and are naturally occurred in complex organ- 
isms in the course of evolution as the result of positive 
selection 'useful' structure/functional elements (28). 
Figure 3A suggests that evolution of the RLFSs follows 
a similar statistical rule. 

We also analysed the frequency of RLFS in each UCSC 
known gene and their splice variants. The distribution of 
RLFS per gene is shown in the Figure 3A. We found that 
~60% of UCSC known gene and splice variant sequences 
contained only one or two RLFS. However, many genes 
and their isoforms carry very large number (> 100) of 
RLFS (Figure 3A and B). Eleven of UCSC known gene 
sequences containing more than 100 RLFSs are repre- 
sented by four gene IDs: IgH (14q32), Ptprn2 (7q36), 
Madlll (7p22) and Sorcs2 (4pl6). IgH, Ptprn2, Madlll 
and Sorcs2 have 105, 140, 104 and 1 15 RLFSs respectively. 

RLFSs occur multiple times in 35% of known genes and 
their splice variants 

Interestingly, RLFSs occur in 16 362 known genes and 
their splice variants only once, whereas 35% (23 358/ 
66 803) of the 66 803 genes and their splice variants 
contain multiple RLFS (Figure 3). This finding implies 
that multiple occurrences of RLFS may play important 
roles in gene expression regulation. 

Immunoglobulin class switch recombination (Ig-CSR) 
is the process in which IgM changes to IgG, IgA, or IgE 
by DNA rearrangement of the Ig heavy chain from IgH(i 
to IgHy, IgHa, or IgHs (31). It occurs at class switch se- 
quences located upstream of the corresponding constant 
domain exons. It was demonstrated that R-loops form at 
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Figure 3. Statistic of RLFS in a gene of the human genome. (A) Numerical characteristics of RLFS distribution. (B) Observed frequency distribution 
of RLFS in a gene of the human genome and its fitting by K-W probability function (see 'Materials and Methods' section). This model fits empirical 
frequency distribution at 6 = 0.9905; a = 1.90, h = 3.83506. 
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Ig-CSR regions in activated B lymphocytes. According to 
R-loop model, inversions of switch regions reduce their 
efficiency (32). It was suggested that R-loop structures 
are necessary for enhancing the CSR process. In 
particular IgH is one of the activated B lymphocyte 
genes in which R-loop formation was reported (21). Our 
analysis reveals 105 RLFSs in IgH. We suggest that abun- 
dance of R-looping regions may play an important role in 
Ig-CSR. 

We also found that Madlll, Ptpm2, Sorcs2 as well as 
IgH are also highly abundant in RLFSs (Figure 3B). It has 
been reported that copy number gains and losses in 
Madlll, Ptprn2 and Sorcs2 can be associated with 
various diseases (33-39). Previous studies also suggested 
an association between R-loop formation and mutations 
in non-Ig genes (14,26). We used COSMIC database 
(URL: http://www.sanger.ac.uk/genetics/CGP/cosmic/) 
to determine mutations in Madlll, Ptprn2 and Sorcs2 
genes across cancer tissue samples. We found mutations 
in Madlll and Sorcs2 in the glioma patient samples, and 
mutations in Ptprn2 in ovarian cancers patient samples. 
We analysed the distances of mutated sites in these genes 
and the location of RLFS. Interestingly, mutated sites and 
RLFS locations overlap in Madlll and are in close prox- 
imity in Sorcs2 (0.98 kb) and Ptprn.2 (0.29 kb). These 
findings suggest that R-loop may contribute to mutagen- 
esis in these genes and abundance of R-looping regions 
might raise the risk of mutagenesis. The R-loop 
mediated mutagenesis and its link with single nucleotide 
polymorphisms (SNPs) and recombination events remains 
an interesting field for further investigation. 

RLFSs can be co-localized with mutation and 
recombination regions 

Single-stranded DNA associated with persisting R-loop is 
less protected from mutagens and thus contributes to oc- 
currence of TAMs, including single-base substitutions, in- 
sertions and deletions. To find evidence for the mutations 
caused by R-loop formation, we integrated SNP data 
from dbSNP database (40) and RLFSs. We found that 
SNPs could be localized in RLFS regions. In particular, 
Figure 4A shows that SNPs in the first exon of Krtl4 are 
strongly enriched within RLFS and thus this RLFS could 
be associated with TAM. Interestingly, among the SNPs, 
there are four non-synonymous (i.e. resulting in amino 
acid changes) SNPs: [rs28928893 (41), rs60171927 (42), 
rs60399023 (43) and rs58330629 (43)]. Each of these 
SNPs is known to cause epidermolysis bullosa simplex 
disease (43). This finding may give insight in association 
of R-loop formation with disease caused mutations. 

Besides mutations, R-loops could also be linked to TAR 
(16,17). When DNA replication and RNA synthesis are 
co-directional, R-loop can produce a replication fork 
stalling and collapse, thus inducing DNA strand breaks. 
To reduce the impact of DNA breaks, DNA repair 
system, such as template switching via homologous re- 
combination process can be activated (44,45). In mamma- 
lian B lymphocytes, R-loop and AID can trigger class 
switching in Ig gene to form DSBs, which in turn cause 
chromosomal translocations via NHEJ (16). 



Besides Ig gene, R-loop can also be detected in onco- 
genes (e.g. Bcl6 and Myc), providing a link to such hall- 
marks of cancer as hypermutation and genome 
rearrangement (14). Defects in the repair of DNA strand 
breaks underpin many hereditary diseases such as 
neurodegeneration and immune dysfunction (46). In 
addition, recombination is not a risk-free event; for 
example there is a chance of loss of heterozygozity 
(LOH), which may eventually lead to development of 
cancer and other genetic diseases. We suggest that our 
DB could be useful for finding important associations 
between RLFS and such types of genome abnormalities. 
We assume that R-loops can initiate recombination during 
late S phase of the cell cycle and contribute to 
AID-dependent translocation of many oncogenes. 

To elucidate the association between RLFS and TAR 
phenomena, we integrated R-loop data with recombin- 
ation breakpoint data from (i) replication-induced recom- 
bination (47) and (ii) AID-dependent translocation data 
set (26). The data set (47) contains the chromosome loca- 
tions of breakpoints found in 7o/?7-deficient human colo- 
rectal carcinoma cells. Topi is a key enzyme that plays an 
important role in the removal of DNA supercoiling 
associated with replication and transcription, leading to 
suppression of genomic instability by preventing interfer- 
ence between replication and transcription. The authors 
found that 7o^7-deficient cells accumulated replication 
forks stalling and recombination breakpoints in the S 
phase. In absence of Topi protein, defective RNA pro- 
cessing leads to the formation of R-loops. That could 
block fork progression and finally generate DNA 
breaks. By over-expressing exogenous RNAaseHl in the 
7o/>7-deficient cells, the authors produced evidence that 
degradation of RNA-DNA hybrids prevents R-loop for- 
mation during gene transcription. 

We compared the regions of breakpoints (47) in 
transcribed genes with predicted RLFSs. We found 
overlaps of breakpoint and RLFS regions in several 
cancer-associated genes. For instance, Figure 4B shows 
chromosome map of Foxo3, as an example of 
co-localization of predicted RLFSs and experimentally 
induced replication-induced recombination breakpoints 
(48). Foxo3 belongs to the O-subclass of the fork head 
family of transcription factors that protect cells against a 
wide range of physiological stresses and is known as a 
tumour suppressor. Foxo3 has been recently reported to 
be a novel target of deletion in human lung adenocarcin- 
oma (48). The Foxo3 deletion regions co-localize with the 
lung adenocarcinoma replication-induced recombination 
breakpoint region and RLFSs defined by our model. 
These findings suggest a causal role of R-loop formation 
in generation of replication-induced recombination break- 
points. One more compelling example is the co- 
localization of R-loop with the deletion regions of 
glycine dehydrogenase (GLDC) gene. GLDC is a 
component of the multiple-enzyme glycine cleavage 
system involved in the major pathway for degradation of 
glycine. The deletion in this gene is a major cause of non- 
ketotic hyperglycinaemia, an inborn error of glycine 
metabolism characterized by accumulation of glycine in 
body fluids leading to various neurological symptoms 
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Figure 4. R-loops co-localization with mutations and recombination regions. (A) R-loop association with transcription-associated mutation (TAM). 
The first annotation track illustrates SNPs location retrieved from dbSNP build 130 (40). SNPs are enriched in RLFS (pink colour) of Krtl4 gene. 
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(49). However, the precise mechanism of deletions in 
GLDC has not been elucidated. Recently the sequence 
boundaries of the deletion regions in GLDC were 
identified (49). It was found that the most 5'end deletion 
breakpoints were located within 5'end gene region. 72% 
(18 out of 25) 5'end deletion breakpoints include exonl - 
exon4 of 25 GLDC exons (49). We found 10 RLFSs; all 
the RLFSs were clustered within exonl, intronl, intron2 
and intron4 (see "GLDC" in R-loopDB). Our database 
search result suggests that R-loop-mediated 
recombination in GLDC could be related to mechanisms 
caused non-ketotic hyperglycinaemia. 

Another piece of evidence supporting direct association 
of our RLFS models with translocation break-points is 
the study of AID-dependent translocation breakpoints 
of Myc gene reported by Duquette et al. (26). 
Translocations of Myc to the Igh switch regions are 



typical for sporadic Burkitt's lymphomas (50,51). 
However, the detection of Igh-Myc translocations was 
found only in the wild-type, but not AID-deficient 
//(5-transgenic mice, implying involvement of AID in 
Igh-Myc translocation (52). Importantly, Duquette et al. 
reported the in vitro formation of R-loop in Myc gene. 
AID requires ssDNA substrate that can be generated by 
R-loop. To validate and show the association of R-loop 
with AID-dependent translocation breakpoints, we 
compared breakpoints of Myc gene to computationally 
predicted RLFSs. Figure 4C demonstrates that our 
model predicted RLFSs in the region overlapping AID- 
dependent translocation breakpoints and located near the 
translocations identified from Burkitt's lymphomas tissues 
and cell lines. These data support a causal role of R-loop 
formation in generation of AID-dependent translocation 
breakpoints. 
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Table 1. Regions where RLFSs co-localize with splice sites of Sorcs2 
gene 
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RLFSs can be involved in alternative splicing 

The connection between R-loop formation and activity of 
splicing factor ASF/SF2 in chicken cell line has been 
demonstrated by Li and Manley (12). The authors 
reported the unexpected finding that genetic inactivation 
of ASF/SF2 protein splicing factor, which is essential for 
alternative splicing process, resulted in the R-loop forma- 
tion. The observation that ASF/SF2 protein prevents 
R-loop formation suggests function of ASF/SF2 protein 
in pre-mRNA processing and the location of R-loop 
formation next to the splice sites (53). However, the asso- 
ciation between R-loop formation and splicing factors 
activity in the human genome is not clear. Linking the 
R-loopDB and the UCSC genome browser allows users 
to study associations between RLFS and various signals 
important for gene expression and genome alterations. 
Besides the alterations on the DNA sequence level, it 
may be interesting to study the connection of RLFS and 
alternative splicing process. As an example of such kind of 
analysis, we studied the localization of the RLFSs and the 
splice sites in Sorcs2 via UCSC genome browser integra- 
tion. We explored the location of RLFS in this gene and 
found that RLFS overlapped with two start sites immedi- 
ately after first exon (Figure 5A). We also found addition- 
al 15 regions where RLFSs co-localize with splice sites 
of Sorcs2 gene. Output from our analysis with 
co-localization information is presented in Table 1. 
Association of R-loop formation with exon skipping 
mechanism could be considered to support our findings. 
Figure 5B shows an example of such association. This is 
the first evidence of R-loop-mediated mRNA splicing in 
the human genes. 

RLFSs in cancer and neurodegenerative diseases 
related genes 

Besides previously reported genes, we also identified novel 
RLFS in more than 200 important genes associated with 
cancer e.g. Tp53, BRCA1/BRCA2 and Kras (Figure 6A), 
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genes common for central nervous system and 
neurodegenerative diseases e.g. ATM, Park2 and Ptprd 
(Figure 6B). According to our study R-loop forming 
mechanism can be associated with other cell types and 
diseases (data not presented). Information about RLFS 
abundance in the above mentioned genes is presented in 
the Figure 3B. This figure shows that genes related to 
cancer and neurodegenerative disease have low abundance 
of RLFS. Figure 5A and B confirms our discussion of 
RLFS co-localization with alternative splicing sites. 
Interestingly, several genes linked to cancer are also the 
targets of the mutator enzyme called AID. 

RLFSs as possible targets of epigenetic 
reprogramming 

RLFS can result in extension of transcription bubble of 
non-template DNA strand and may play important role in 
gene modification and epigenetic reprogramming. 
Activation-induced cytidine deaminase/apolipoprotein B 
RNA-editing catalytic component (AID/APOBEC) is a 
group of enzymes capable of editing nucleic acid 
through deamination of cytosines to uracils. The recent 
discoveries indicated that AID is critical for epigenetic 
reprogramming in mammals (54,55). AID needs ssDNA 
substrate, and thus R-loop forming mechanism could 
provide a substrate for AID. This enzyme is active in 
primordial germ cells (PGCs) and in early embryos 
where demethylation occurs. The rate of methylation 
was found to be up to three fold higher in wild-type 
PGCs comparing to AID-deficient PGCs (54,55). 
AID-mediated demethylation occurred throughout the 
genome at specific target regions rather than globally 
and a mechanism regulating this demethylation is 
unknown. We hypothesize that R-loop structure may be 
a potential target of AID-mediated epigenetic 
reprogramming. 

To support this hypothesis, we identified co-localization 
of RLFSs in Dazl and Foxol genes. These genes are 
known to become demethylated during PGC development 
and more highly methylated in AID-deficient PGCs (55). 
In the recent study, it has been shown that incorrect DNA 
methylation of Dazl gene is associated with defective 
human sperm (56). Figure 7 demonstrates that the pre- 
dicted RLFSs of Dazl and Foxol genes are located in 
the demethylated area processed by AID. Interestingly, 
RLFSs are co-localized in the first intron and CpG 
islands of both genes. These findings and our other obser- 
vations revealed by using R-loopDB search tool imply an 
association of RLFS with epigenetic modification and 
transcription initiation and elongation. Thus our prelim- 
inary study using R-loopDB suggests (i) an association of 
RLFS with AID activity which may be functional not only 
in case of Ig genes but also other genes related to epigen- 
etic reprogramming and (ii) the RLFS model should be 
used in future study of a role of R-loop forming mechan- 
ism in AID-mediated epigenetic reprogramming. Other 
interesting directions of the implementation of predicted 
RLFSs (and R-loop formation) may be relevant to the 
mechanisms that underlie the RNA-directed transcription 
gene silencing (57) and Dnmtl- mediated DNA 
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Figure 5. RLFS associated with splice variants and exon skipping sites. (A) RLFS located near spliced sites of Sorcs2 gene. Blue line represents 
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Figure 6. RLFSs associated with essential genes. (A) RLFSs on cancer-related genes (B) RLFSs on central nervous system and neurodegenerative 
diseases. 



methylation in non-CpG context in DNA bubbles leading 
to silencing of DNA replication and transcriptionally 
active loci (60). 

Future experimental and technological approaches to 
analysis of RLFS and R-loops 

The formation of R-loops using short RNA probes having 
RIZ and REZ sequences, predicted and collected in our 
R-loop DB can have several technological applications. 
Using computationally predicted RNA sequences, a 
method for directing the enzymatic double-stranded 
scission of RLFS DNA could be developed. A protocol 
of such 'R-loop-extraction assay' should consist of the 



following steps (i) sequence-specific R-loop formation; 
(ii) chemical modification of the displaced single strand 
of DNA with base-specific modification reagents to stabil- 
ize the R-loop such as neomycin (61-64) and block 
renaturation of DNA; (iii) hydrolysis of RNA used 
for R-loop formation to render both DNA strands sensi- 
tive for scission/cleavage at either end of single-stranded 
bubble formed by R-loop formation; (v) amplification of 
the specific RLFS DNA and (vi) computational analysis of 
the reaction products. Finally using the next generation 
sequencing (NGS) technique such method could be 
implemented in the highly specific assay to study the struc- 
tural and functional roles of naturally occurring and 
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artificially generated R-loop formation sequences in the 
individual human genes, different gene groups and 
genome regions. 

CONCLUSION 

In this work, we described a quantitative model of RLFS 
and created the R-loopDB, the first database of RLFS 
intended for detailed investigation of their sequences, 
location and RLFS-containing genes. Our web implemen- 
tation supports various types of query that allows user to 
find not only genes of interest, but also their splice 
variants and the regions of epigenetic modifications 
associated with RLFSs. These regulatory signals can 
provide novel understanding of the gene expression regu- 
lation and complexity of RNA-DNA interactions in the 
genome and transcriptome functions. 

The prediction of RLFSs in over half of the human 
genes reveals a novel level of RNA-DNA interactome 
complexity that perhaps will lead to a better understand- 
ing of the role of R-loop forming structure in gene 
expression controls and epigenetic modifications. The 
specific conformation of RNA-DNA hybrid formation 
also provides a unique target for controlling the transfer 
of genetic information through binding by small mol- 
ecules. The knowledge of R-loop studies show that 
RNA can interact with DNA and generates a few benefi- 
cial effects and a lot of harmful effects in cells. In our 
study, we provide biological insights into the R-loop struc- 
ture in several molecular machineries. In particular, our 
findings suggested that (i) over half of transcripts contain 
at least one R-loop indicating that RLFSs present a 
common regulatory element essential for gene expression 
controls and epigenetic modifications; (ii) multiple occur- 
rences of the RLFS in essential genes suggest specific role 
of RLFS in these genes; (hi) R-loops may be directly 
involved in alternative splicing process; (iv) mutation 
and genome variations may be associated with R-loop 
formation and (v) RLFS may help AID in epigenetic 
reprogramming in development. Finally, our database 



provides comprehensive analysis of R-loops in essential 
genes related to cancer, neurodegenerative diseases and 
many genetic diseases. 

We provide a workflow of R-loop extraction assay, 
which could be used for implementation of our 
R-loopDB predictions. Identification of RLFS in 
personal human genomes, mammalian and non- 
mammalian species and analysis of conservation and evo- 
lution of RLFS will be studied in the further study. 

We found that R-loops are widely encountered in a vast 
majority of genes of the human genome. R-loopDB 
provides the first comprehensive catalogue of RLFS, 
which could be used in the systematic studies of the 
structures and functions of R-loops in normal and 
abnormal cells, as well as in the drug industry and 
clinical research applications. We expect that R-loopDB 
will help researchers in the R-loop analysis and design of 
the experiments aimed to discover mutated sites and 
epigenetic modifications in RLFS- identified genes. 
We also believe that R-loopDB will be useful for drug 
discovery and identification of new classes of therapeutic 
targets. 
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